Cross-validation experiments with networks¶
pathpy provides basic support for evaluations based on
cross-validation experiments. In particular, the train_test_split
method can be used to create train and test splits. The semantics of the
method as well as the arguments is similar to the corresponding
function in
``sklearn` <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>`__.
To demonstrate the use, we generate a random graph:
import pathpy as pp
n = pp.generators.ER_np(100, 0.04)
print(n)
n.plot()
Uid: 0x7f8c70080d30
Type: Network
Directed: False
Multi-Edges: False
Number of nodes: 100
Number of edges: 215
To generate a test and train network instance, where the test network contains a random fraction of 25 % of the nodes, we can write:
test, train = pp.algorithms.evaluation.train_test_split(n, test_size = 0.25)
print(test)
print(train)
Uid: 0x7f8c70080d30_test
Type: Network
Directed: False
Multi-Edges: False
Number of nodes: 25
Number of edges: 19
Uid: 0x7f8c70080d30_train
Type: Network
Directed: False
Multi-Edges: False
Number of nodes: 75
Number of edges: 124
The method generates two new Network instances that refer to the same
node and edge objects as the original network, i.e. the new objects do
not consume a lot of memory. The original network instance is not
changed. The uids of the newly generated networks will be set to the
original uid with a suffix of _test and _train respectively.
By default, the split will be made based on the nodes, and the train and
test networks will include all incident edges for the corresponding node
sets. This implies that some edges can be lost if the split is made
along the endpoints. To preserve the number of edges, we can set the
split method to edge. This will sample a random fraction of edges,
and all nodes are added to both networks, i.e. the node sets between the
two networks are identical. The sum of the edges of the training and
test network equals the number of edges in the original network.
test, train = pp.algorithms.evaluation.train_test_split(n, test_size = 0.25, split='edge')
print(test)
print(train)
Uid: 0x7f8c70080d30_test
Type: Network
Directed: False
Multi-Edges: False
Number of nodes: 100
Number of edges: 53
Uid: 0x7f8c70080d30_train
Type: Network
Directed: False
Multi-Edges: False
Number of nodes: 100
Number of edges: 162
We can alternatively set the size of the training set:
test, train = pp.algorithms.evaluation.train_test_split(n, train_size = 0.25, split='edge')
print(test)
print(train)
Uid: 0x7f8c70080d30_test
Type: Network
Directed: False
Multi-Edges: False
Number of nodes: 100
Number of edges: 161
Uid: 0x7f8c70080d30_train
Type: Network
Directed: False
Multi-Edges: False
Number of nodes: 100
Number of edges: 54
Apart from static networks, we can also create cross-validation sets for temporal networks. For this, we first load a temporal network from the KONECT database:
tn = pp.io.konect.read_konect_name('sociopatterns-hypertext')
print(tn)
tn.plot()
Uid: 0x7f8c9139e280
Type: TemporalNetwork
Directed: False
Multi-Edges: True
Number of unique nodes: 113
Number of unique edges: 2196
Number of temp nodes: 113
Number of temp edges: 20818
Observation periode: 1246255220 - 1246467561.0
Network attributes
------------------
category: HumanContact
code: HY
name: Hypertext 2009
description: Visitor–visitor face-to-face contacts
extr: sociopatterns
url: http://www.sociopatterns.org/
long-description: This is the network of face-to-face contacts of the attendees of the ACM Hypertext 2009 conference. The ACM Conference on Hypertext and Hypermedia 2009 (HT 2009, http://www.ht2009.org/) was held in Turin, Italy over three days from June 29 to July 1, 2009. In the network, a node represents a conference visitor, and an edge represents a face-to-face contact that was active for at least 20 seconds. Multiple edges denote multiple contacts. Each edge is annotated with the time at which the contact took place.
entity-names: visitor
relationship-names: contact
cite: konect:sociopatterns
time: 2009-06-29/2009-07-01
timeiso: 2009-06-29/2009-07-01
We can call the same function on a temporal network instance. By default, the split will be made based on the observed interactions, i.e. in the following example the first 75 % of all time-stamped interactions will be included in the training network, while the last 25 % will be included in the test network.
test, train = pp.algorithms.evaluation.train_test_split(tn, test_size=0.25)
print(train)
print(test)
Uid: 0x7f8c9139e280_train
Type: TemporalNetwork
Directed: False
Multi-Edges: True
Number of unique nodes: 112
Number of unique edges: 1854
Number of temp nodes: 112
Number of temp edges: 15614
Observation periode: 1246255220 - 1246441061.0
Uid: 0x7f8c9139e280_test
Type: TemporalNetwork
Directed: False
Multi-Edges: True
Number of unique nodes: 95
Number of unique edges: 713
Number of temp nodes: 95
Number of temp edges: 5204
Observation periode: 1246441080 - 1246467561.0
train.plot()
test.plot()
We can also split based on the observed time, i.e. here we include all interactions ocurring within in the first 75 % of the observed time period in the training network, while the remaining interactions are included in the test network.
test, train = pp.algorithms.evaluation.train_test_split(tn, test_size=0.25, split='time')
print(train)
print(test)
Uid: 0x7f8c9139e280_train
Type: TemporalNetwork
Directed: False
Multi-Edges: True
Number of unique nodes: 113
Number of unique edges: 2196
Number of temp nodes: 113
Number of temp edges: 20815
Observation periode: 1246255220 - 1246467541.0
Uid: 0x7f8c9139e280_test
Type: TemporalNetwork
Directed: False
Multi-Edges: True
Number of unique nodes: 5
Number of unique edges: 3
Number of temp nodes: 5
Number of temp edges: 3
Observation periode: 1246467560 - 1246467561.0
